PySpark: Dataframe Explode

您所在的位置:网站首页 pyspark explode 改名 PySpark: Dataframe Explode

PySpark: Dataframe Explode

2024-07-09 22:51| 来源: 网络整理| 查看: 265

PySpark: Dataframe Explode

Explode function can be used to flatten array column values into rows in Pyspark. This tutorial will explain following explode methods available in Pyspark to flatten (explode) array column, click on item in the below list and it will take you to the respective section of the page: explode posexplode explode_outer posexplode_outer explode & posexplode functions will not return records if array is empty, it is recommended to use explode_outer & posexplode_outer functions if any of the array is expected to be null. Users will get below error if they will try to use multiple explode in a single select statement. Error: pyspark.sql.utils.AnalysisException: Only one generator allowed per select clause but found 2: explode(_2), explode(_3) Users can visit this page to understand various approaches to explode multiple columns. Sample Data: Following 2 dataset will be used in the below examples. Dataset 1: data_list = [(1, [5,6,7]), (2, [6,7,8])] df= spark.createDataFrame( sc.parallelize(data_list) ) df.show() +---+---------+ | _1| _2| +---+---------+ | 1|[5, 6, 7]| | 2|[6, 7, 8]| +---+---------+ Dataset 2: data_list_2=[(1, [2,3,4]), (2,[3,4,5]), (3, [])] df_2= spark.createDataFrame(sc.parallelize(data_list_2)) df_2.show() +---+---------+ | _1| _2| +---+---------+ | 1|[2, 3, 4]| | 2|[3, 4, 5]| | 3| []| +---+---------+ ➠ explode function(): explode function can be used to flatten array column values as rows.

Syntax: It can take 1 array column as parameter and returns flattened values into rows with a column named "col". from pyspark.sql.functions import explode explode(array_column) Example: explode function will take array column as input and return column named "col" if not aliased with required column name for flattened column. from pyspark.sql.functions import explode df_update = df.select("_1", explode("_2")) df_update.show() +---+---+ | _1|col| +---+---+ | 1| 5| | 1| 6| | 1| 7| | 2| 6| | 2| 7| | 2| 8| +---+---+ ➠ posexplode function(): posexplode function works similar to explode function to flatten array column values as rows but it will also return position of the array value as additional column.

Syntax: It can take 1 array column as parameter and returns flattened values into rows with a column named "col" and position of each value in the array. from pyspark.sql.functions import posexplode posexplode(array_column) Example: posexplode function will take array column as input and return 2 columns named "pos" of value in array and named "col" which will hold actual value. from pyspark.sql.functions import posexplode df_update = df.select("_1", posexplode("_2")) df_update.show() +---+---+---+ | _1|pos|col| +---+---+---+ | 1| 0| 5| | 1| 1| 6| | 1| 2| 7| | 2| 0| 6| | 2| 1| 7| | 2| 2| 8| +---+---+---+ ➠ explode_outer function(): explode_outer function will work exactly like explode function, only difference will be that explode function will not return records if array is empty but explode_outer function will return such records as well.

Syntax: It can take 1 array column as parameter and returns flattened values into rows with a column named "col". from pyspark.sql.functions import explode_outer explode_outer(array_column) Example: explode_outer function will take array column as input and return column named "col" if not aliased with required column name for flattened column. from pyspark.sql.functions import explode_outer df_update = df_2.select("_1", explode_outer("_2")) df_update.show() +---+----+ | _1| col| +---+----+ | 1| 2| | 1| 3| | 1| 4| | 2| 3| | 2| 4| | 2| 5| | 3|null| # Null was returned +---+----+ ➠ posexplode_outer function(): posexplode_outer function will work exactly like posexplode function, only difference will be that posexplode function will not return records if array is empty but posexplode_outer function will return such records as well.

Syntax: It can take 1 array column as parameter and returns flattened values into rows with a column named "col" and position of each value in the array. from pyspark.sql.functions import posexplode_outer posexplode_outer(array_column) Example: posexplode_outer function will take array column as input and return 2 columns named "pos" of value in array and named "col" which will hold actual value. from pyspark.sql.functions import posexplode_outer df_update = df_2.select("_1", posexplode_outer("_2")) df_update.show() +---+----+----+ | _1| pos| col| +---+----+----+ | 1| 0| 2| | 1| 1| 3| | 1| 2| 4| | 2| 0| 3| | 2| 1| 4| | 2| 2| 5| | 3|null|null| # Null was returned +---+----+----+



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3